perm filename PROPON[7,ALS] blob sn#032365 filedate 1973-04-02 generic text, type T, neo UTF8
					April 1 1973

	A Proposal for Speech Understanding Research


	It  is  proposed  that the work on speech recognition that is
now under way in the A.I. project at Stanford University be continued
and  extended  as a separate project with broadened aims in the field
of speech understanding.

	It is further proposed that this work be more closely tied to
the ARPA Speech Understanding Research groups than it has been in the
past and that it have as its express aim the study and application to
speech  recognition  of  a  machine learning process, that has proved
highly successful in another application and that  has  already  been
tested  out  to a limited extent in speech recognition.   The machine
learning process offers both an automatic  training  scheme  and  the
inherent  ability  of  the  system  to  adapt to various speakers and
dialects. Speech recognition via machine learning represents a global
approach  to  the  speech recognition problem and can be incorporated
into a wide class of limited vocabulary systems. Ultimately we  would
like  to  have  a  system  capable  of  understanding  speech from an
unlimited domain of discourse and with unknown speakers. It seems not
unreasonable  to  expect  the system to deal with this situation very
much as people do when they adapt their  understanding  processes  to
the speakers idiosyncrasies during the conversation.

	With  so  much  of  the  current work on speech understanding
being devoted to the development of systems designed  to  work  in  a
limited  field of discourse and with a limited number of speakers, it
seems desirable for a minimal program to be continued that is not  so
restricted. It is felt that we should not lose sight of those aspects
of the problem that are for the moment peripherial to  the  immediate
aims  of  developing  the  best complete system that can currently be
built.  Stanford University is well suited as the site for such work,
having  both  the facilities for this work and a staff of people with
experience and interest in machine learning, phonetic  analysis,  and
digital signal processing.

	The  initial  thrust of the proposed work would be toward the
development of adaptive  learning  techniques,  using  the  signature
table  method  and  some  more recent varients and extentions of this
basic procedure. We have already demonstrated the usefulness of  this
method  for  the  initial  assignment  of significant features to the
acoustic signals. One of the next steps will be to extend the  method
to  include  acoustic-phonetic probabilities in the decision process.
Finally we would hope to take account  of  syntactical  and  semantic
constraints in a somewhat analogous fashion.

	Still  another  aspect  to  be studied would be the amount of
preprocessing that should be done and  the  desired  balance  between
bottom-up  and  top-down  approaches.    It  is  fairly  obvious that
decisions of this sort should ideally be  made  adaptively  depending
upon  the  familiarity  of  the  system  with  the  current domain of
discourse and  with  the  characteristics  of  the  current  speaker.
Compromises  will  undoubtedly  have  to  be  made in any immediately
realizable system but we should understand better than we now do  the
limitations on the system that such compromises impose.

	Finally we would propose accepting responsibility for keeping
other related projects supplied with operating versions of  the  best
current  programs that we have developed to interface the output from
the digitized speech or from a frequency domain  expression  of  this
input to the rest of the overall system.

	It  may  be  well  at  this  point  to  discribe  the general
philosophy that has been followed in the work that is currently under
way  and  the results that have been achieved to date.   We have been
studying  elements  of  a  speech  recognition  system  that  is  not
dependent upon the use of a limited vocabulary and that can recognize
continuous speech by a number of different speakers.

	Such a system should be able to function successfully  either
without any previous training for the specific speaker in question or
after a short training session in which the speaker would be asked to
repeat certain phrases designed to train the system on those phonetic
utterances that seemed to depart from the previously learned norm. In
either  case  it  is  believed  that some automatic or semi-automatic
training system should be employed to acquire the data that  is  used
for the identification of the phonetic information in the speech.  We
believe that this can best be done by employing a modification of the
signature  table  scheme previously discribed. A brief review of this
earlier form of signature table is given in Appendix 1.

	The over-all system is envisioned as one in which the more or
less  conventional method is used of separating the input speech into
short  time  slices  for  which  some  sort  of  frequency  analysis,
homomorphic,  LPC,  or  the  like,  is done.   We then interpret this
information in terms of significant features by means  of  a  set  of
signature  tables.   At  this  point we define longer sections of the
speech called EVENTS which are obtained by grouping togather  varying
numbers  of the original slices on the basis of their similarity.This
then takes the place of other forms of initial segmentation.   Having
identified  a series of EVENTS in this way we next use another set of
signature tables to extract information from the sequence  of  events
and  combine  it  with  a  limited  amount  of syntactic and semantic
information to define a sequence of phonemes.


	Signature  tables  can  be  used  to  perform  four essential
functions that are required in the automatic recognition  of  speech.
These   functions  are:   (1)  the  elimination  of  superfluous  and
redundant  information  from  the  acoustic  input  stream,  (2)  the
transformation  of  the  remaining  information  from  one coordinate
system to a more phonetically meaningful coordinate system,  (3)  the
mixing  of  acoustically  derived  data  with syntactic, semantic and
linguistic information to obtain the desired recognition, and (4) the
introduction of a learning mechanism.

	The  following  three  advantages  emerge from this method of
training and evaluation.
	1)  Essentially  arbitrary  inter-relationships  between  the
input terms are taken in account by any one table. The only  loss  of
accuracy is in the quantization.
	2) The training is a  very  simple  process  of  accumulating
counts.   The training samples are introduced sequentially, and hence
simultaneous storage of all the samples is not required.
	3)  The  process  linearizes  the storage requirements in the
parameter space.

	The signature tables, as used in speech  recognition,must  be
particularized  to allow for the multi-catagory nature of the output.
Several forms of tables have been investigated. Details of the current
system are given in Appendix 2. Some results are summarized in an
attached report.


	Work  is  currently  under  way  on a major refinement of the
signature table  approach  which  adopts  a  somewhat  more  rigorous
procedure.    Preliminary  results  with  this scheme indicate that a
substantial improvement has been achieved.


		Appendix 1

	The early form of a signature table

	For  those  not  familiar with the use of signature tables as
used by Samuel in programs which played the  game  of  checkers,  the
concept  is best illustrated (Fig.1) by an arrangement of tables used
in the program.  There are 27 input  terms.  Each  term  evaluates  a
specific  aspect  of  a  board  situation  and it is quantized into a
limited but adequate range of values, 7,5,and 3, in  this  case.  The
terms  are divided into 9 sets with 3 terms each, forming the 9 first
level tables. Outputs from the first level tables are quantized to  5
levels and combined into 3 second level tables and, finally, into one

third-level table whose output represents the figure of merit of  the
board in question.
	A signature table has an entry for every possible combination
of  the  input vector. Thus there are 7*5*3 or 105 entries in each of
the first level tables. Training consists of accumulating two  counts
for  each  entry  during a training sequence.  Count A is incremented
when the current input vector represents a prefered move and count  D
is  incremented when it is not the prefered move. The output from the
table is computed as a correlation coeficient
			C=(A-D)/(A+D).   The  figure  of  merit for a
board is simply the coefficient obtained as the output from the final
table.


		Appendix 2

	Initial Form of Signature Table for Speech Recognition

	The  signature  tables, as used in speech recognition,must be
particularized to allow for the multi-catagory nature of the  output.
Several  forms  of  tables  have  been investigated. The initial form
tested and used for the data presented  in  the attached  paper  uses
tables  consisting of two parts, a preamble and the table proper. The
preamble contains: (1) space for saving a record of the  current  and
recent  output reports from the table, (2) identifying information as
to the specific type of table, (3) a parameter  that  identifies  the
desired  output  from  the  table  and  that  is used in the learning
process, (4) a gating parameter specifying the input, that is  to  be
used  to  gate  the  table,  (6)  the gating level to be used and (7)
parameters that identify the sources of  the  normal  inputs  to  the
table.

	All  inputs  are  limited  in  range  and  specify either the
absolute level of some basic property or more usually the probability
of some property being present. These inputs may be from the original
acoustic input or they may be the outputs of other  tables.  If  from
other  tables  they  may  be for the current time step or for earlier
time steps, (subject to practical limits as to  the  number  of  time
steps that are saved).

	The output, or outputs, from each table are similarly limited
in  range  and  specify,  in  all  cases,  a  probability  that  some
particular significant feature, phonette, phoneme, word segment, word
or phrase is present.

	We are limiting the range of inputs  and  outputs  to  values
specified  by  3  bits  and  the  number  of  entries per table to 64
although this choice of values  is  a  matter  to  be  determined  by
experiment.  We  are  also  providing  for any of the following input
combinations, (1) one input of 6 bits, (2) two inputs of 3 bits each,
(3)  three  inputs  of 2 bits each, and (4) six inputs of 1 bit each.
The uses to which these differint forms are  put  will  be  described
later.

	The  body  of  each  table  contains entries corresponding to
every possible combination of  the  allowed  input  parameters.  Each
entry  in  the  table  actually  consists of several parts. There are
fields assigned to accumulate counts of the occurrances of  incidents
in  which  the  specifying  input values coincided with the different
desired outputs from the table  as  found  during  previous  learning
sessions  and  there  are fields containing the summarized results of
these learning sessions, which are used as outputs  from  the  table.
The  outputs from the tables can then express to the allowed accuracy
all possible functions of the input parameters.

Operation in the Training Mode

	When operating in the training mode the program  is  supplied
with  a  sequence  of  stored  utterances  with accompanying phonetic
transcriptions.  Each  segment  of  the  incoming  speech  signal  is
analysed  (Fourier transforms or inverse filter equivalent) to obtain
the necessary input parmeters for the  lowest  level  tables  in  the
signature  table  hierarchy.  At the same time reference is made to a
table of phonetic "hints" which prescribe the  desired  outputs  from
each  table  which  correspond  to  all possible phonemic inputs. The
signature tables are then processed.

	The processing of each  table  is  done  in  two  steps,  one
process  at each entry to the table and the second only periodically.
The first process consists of locating a single entry line within the
table  as  specified by the inputs to the table and adding a 1 to the
appropriate field to indicate the presence of the property  specified
by  hint  table  as  corresponding  to  the  phoneme specified in the
phonemic transcription. At this time a report is also made as to  the
table's  output  as  determined from the averaged results of previous
learning so that a running record may be kept of the  performance  of
the   system.  At  periodic  intervals  all  tables  are  updated  to
incorporate recent learning results.  To  make  this  process  easily
understandable,  let  us  restrict  our  attention to a table used to
identify a single significant feature say  Voicing.  The  hint  table
will identify whether or not the phoneme currently being processed is
to be considered voiced. If it is voiced, a 1 is added to  the  "yes"
field of the entry line located by the normal inputs to the table. If
it is not voiced, a 1 is added to the "no" field.  At  updating  time
the  output that this entry will subsequently report is determined by
dividing the accumulated sum in the "yes" field by  the  sum  of  the
numbers in the "yes" and the "no" fields, and reporting this quantity
as a number in the range from 0 to 7. Actually the process is  a  bit
more complicated than this and it varies with the exact type of table
under consideration, as reported in detail  in  appendix  B.  Outputs
from the signature tables are not probabilities, in the strict sense,
but  are  the  statistically-arrived-at  odds  based  on  the  actual

learning sequence.

	The  preamble  of the table has space for storing twelve past
outputs. An input to a table can be delayed to that extent.This table
relates  outcomes  of  previous  events  with  the  present  hint-the
learning input.A certain amount of context dependent learning is thus

possible with the limitation that the specified delays are constant.

	The  interconnected  hierarchy of tables form a network which
runs increamentally, in steps synchronous with time window over which
the  input signal is analised.The present window width is set at 12.8
ms.(256 points at 20 K samples/sec.) with overlap of 6.4  ms.  Inputs
to  this  network  are  the  parameters abstracted from the frequency
analyses of the signal, and the specified  hint.The  outputs  of  the
network  could  be  either the probability attached to every phonetic
symbol or the output of a table associated with  a  feature  such  as
voiced,vowel  ect.The  point  to be made is that the output generated
for  a  segment  is  essentially  independent   of   its   contiguous
segments.The  dependency  achieved  by using delayes in the inputs is
invisible to the outputs.The outputs thus report the best estimate on
what  the  current  acoustic  input  is  with no relation to the past
outputs.Relating the successive outputs along the time  dimension  is
realised by counters.

The Use of COUNTERS

	The  transition  from initial segment space to event space is
made posible by means of COUNTERS  which  are  summed  and  reiniated
whenever   their  inputs  cross  specified  threshold  values,  being
triggered on when the input exceeds the threshold  and  off  when  it
falls  below.   Momentary  spikes  are  eliminated by specifying time
hysteresis, the number of consecutive segments for  which  the  input
must  be  above  the  threshold.The  output  of  a  counter  provides
information about starting time,duration and average  input  for  the
period it was active.

	Since  a  counter  can  reference a table at any level in the
hierarchy of tables, it can reflect any desired degree of information
reduction.  For example, a counter may be set up to show a section of
speech to be a vowel,a front vowel or the vowel /I/.The counters  can
be  looked upon to represent a mapping of parameter-time space into a
feature-time space, or at a higher level symbol-time space.It may  be
useful  to  carry along the feature information as a back up in those
situations where  the  symbolic  information  is  not  acceptable  to
syntactic or semantic interpretation.

	In the same manner as the tables, the counters run completely
independent of each other.In  a  recognition  run  the  counters  may
overlap in arbitrary fashion, may leave out gaps where no counter has
been triggered or may not line up nicely.A properly segmented output,
where  the  consecutive  sections are in time sequence and are neatly
labled, is essential for processing it further.This  is  achieved  by
registering   the   instants  when  the  counters  are  triggered  or
terminated to form time segments called events.

	An event is  the  period  between  successive  activation  or
termination  of any counter.An event shorter than a specified time is
merely ignored. A record of event durations  and  upto  three  active
counters, ordered according to their probability, is maintained.

	An  event  resulting  from  the  processing described so far,
represents a phonette - one of the basic speech categories defined as
hints in the learning process. It is only an estimate of closeness to
a speech category , based on past learning.Also each category  has  a
more-or-less stationary spectral characterisation.Thus a category may
have a phonemic equivalent as in the case  of  vowels  ,  it  may  be
common to phoneme class as for the voiced or unvoiced stop gaps or it
may be subphonemic as a T-burst or a K-burst.The choices are based on
acoustic  expediency,  i.e.  optimisation of the learning rather than
any linguistic considerations.However  a  higher  level  interpretive
programs   may   best   operate   on   inputs   resembling   phonemic
trancription.The contiguous events may be coalesced into phoneme like
units  using  diadic  or  triadic probabilities and acoustic-phonetic
rules particular to the  system.For  example,  a  period  of  silence
followed  by  a  type of burst or a short friction may be combined to
form the corresponding stop.A short friction or a burst  following  a
nasal or a lateral may be called a stop even if the silence period is
short or absent.Clearly these rules must be specific to  the  system,
based  on the confidence with which durations and phonette categories
are recognised.

	While it would be possible to extend this bottom up  approach
still  further,  it  seems  reasonable to break off at this point and
revert to a top down approach from here on. The  real  difference  in
the  overall  system  would  then be that the top down analysis would
deal with the  outputs  from  the  signature  table  section  as  its
primatives rather than with the outputs from the initial measurements
either in the time domain or in the frequency domain. In the case  of
inconsistancies  the  system could either refer to the second choices
retained within the signature tables or if need be  could  always  go
clear  back  to  the  input parameters. The decision as to how far to
carry the initial bottom up analysis must depend  upon  the  relative
cost  of this analysis both in complexity and processing time and the
certainty with which it can be performed as compaired with the  costs
associated with the rest of the analysis and the certainty with which
it can be performad, taking due  notice  of  the  costs  in  time  of
recovering from false starts.